The second task of this project is to perform an in-depth analysis of the power consumption data set from a data scientist perspective. This will be presented to management for a propective IoT home developer. This is accomplished via data visualization and time series regression modeling.
=============================================================================================== The Following are Plans of Attach =============================================================================================== Step 1 - Subset of Data and Granularity Subsetting the data into meaningful time periods for insights of sub-metered energy consumption and application of granularity. Such insight could be used as incentive to lure a potential home buyers interested in “smart home” technology.
Step 2 - Exploration of The Data Using Visualization Techniques in R The most information-laden visualizations are selected for presentation during this process.
Step 3 - Time Series Regression Models for Both Seasonal and Non-Seasonal Forcasting. Three different time series regression models are developed and work with seasonal and non-seasonal forecasting.
Step 4 - Summary of The Analysis and Recommendations Report to The Client.
| Task One (Goals and Objectives) The initial report on Task One submitted to the client already defined the business objectives. The high-level business objective outlined in Initial Presentation Report To IoT Client Are: |
| 1. Determine if the installation of sub-metering devices to measure power consumption can translate into economic incentive homeowners and the client. 2. Determine what kind of analytics and visualizations Could be obtained from the data about energy consumptiont. 3. The IoT client’s goal is to offer highly efficient Smart Homes that provide customers with power usage analytics. Hoping that these analytics will help to grow their business. |
Analytical Focus Points: (i). Sub-metered energy consumption data that provides enough granularity to uncover trends in Sub-meters areas in the home. (ii). Peak energy usage can be identified allowing for possible modification of behavior to take advantage of off-peak electricity rates if offered by local electric provider. (iii). Patterns of energy usage that can be used to predict future usage.
## Loading required package: DBI
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.3 v purrr 0.3.2
## v tidyr 0.8.3 v dplyr 0.8.1
## v readr 1.3.1 v stringr 1.4.0
## v tibble 2.1.3 v forcats 0.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::extract() masks magrittr::extract()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
## x dplyr::src() masks Hmisc::src()
## x dplyr::summarize() masks Hmisc::summarize()
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## Loading required package: colorspace
## Loading required package: grid
## Loading required package: data.table
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
## VIM is ready to use.
## Since version 4.0.0 the GUI is in its own package VIMGUI.
##
## Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## Warning: package 'ggfortify' was built under R version 3.6.1
## Warning: package 'forecast' was built under R version 3.6.1
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Registered S3 methods overwritten by 'forecast':
## method from
## autoplot.Arima ggfortify
## autoplot.acf ggfortify
## autoplot.ar ggfortify
## autoplot.bats ggfortify
## autoplot.decomposed.ts ggfortify
## autoplot.ets ggfortify
## autoplot.forecast ggfortify
## autoplot.stl ggfortify
## autoplot.ts ggfortify
## fitted.ar ggfortify
## fitted.fracdiff fracdiff
## fortify.ts ggfortify
## residuals.ar ggfortify
## residuals.fracdiff fracdiff
##
## Attaching package: 'plotly'
## The following object is masked from 'package:Hmisc':
##
## subplot
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
2.1 Create a database connection to extract data
## [1] "iris" "yr_2006" "yr_2007" "yr_2008" "yr_2009" "yr_2010"
2.2 Use the dbGetQuery function to download tables 2007 through 2010 data sets with the specified attributes
## Warning in .local(conn, statement, ...): Unsigned INTEGER in col 0 imported
## as numeric
## Warning in .local(conn, statement, ...): Unsigned INTEGER in col 0 imported
## as numeric
## Warning in .local(conn, statement, ...): Unsigned INTEGER in col 0 imported
## as numeric
## Warning in .local(conn, statement, ...): Unsigned INTEGER in col 0 imported
## as numeric
Create a Multi-Year data frame to serve as the primary data frame for this project. using dplyr function “bind_rows” Combine tables or df (ONLY includes the df that span an entire year: 2007, 2008, 2009)
2.3 Gather Summary Statistics mean, mode, standard deviation, quartiles & characterization of the distribution. With data loaded, the summary() functions are used to look at the structure of the data set.
## Warning: 'glance.data.frame' is deprecated.
## See help("Deprecated")
## # A tibble: 1 x 4
## nrow ncol complete.obs na.fraction
## <int> <int> <int> <dbl>
## 1 1569894 10 1569894 0
2.4 Dealing with Missing Values (NA)
From the output of the summary of the data frame, ‘newDF’ we can see that there are 1,569,894 observations. The summary statistics of the features show that there are no missing values (NA) because they were removed from previous cleansing.
## id Date Time Global_active_power Global_reactive_power
## 1 1 2007-01-01 00:00:00 2.580 0.136
## 2 2 2007-01-01 00:01:00 2.552 0.100
## 3 3 2007-01-01 00:02:00 2.550 0.100
## 4 4 2007-01-01 00:03:00 2.550 0.100
## 5 5 2007-01-01 00:04:00 2.554 0.100
## 6 6 2007-01-01 00:05:00 2.550 0.100
## Global_intensity Voltage Sub_metering_1 Sub_metering_2 Sub_metering_3
## 1 10.6 241.97 0 0 0
## 2 10.4 241.75 0 0 0
## 3 10.4 241.64 0 0 0
## 4 10.4 241.71 0 0 0
## 5 10.4 241.98 0 0 0
## 6 10.4 241.83 0 0 0
Here, data munging of newDF to create a ‘DateTime’ attribute by combining the ‘Date’ and ‘Time’ columns within data frame. This data processing steps is essential to get the data ready for exploratory analysis and modeling.
3.1 Using ‘cbind’ function from tidyr to combine the ‘Date’ and ‘Time’ features to create a ‘DateTime’ feature. The header name for new ‘DateTime’ attribute in the 11th column is also addressed.The cryptic coding used for the format is explained in R’s help section (type ?strptime).
We eliminate unwanted columns. Exclude “Date”, “Time”, “id” features and create new data frame.
3.2 Convert DateTime data type to Time Series Using POSIXct () function that R Understands. After combining Date and Time columns, the DateTime feature is of the character class. Therefore, POSIXct() function is used to convert it into the proper data class.
## Warning in strptime(xx, f, tz = tz): unknown timezone '%Y/%m/%d %H:%M:%S'
## Warning in as.POSIXct.POSIXlt(x): unknown timezone '%Y/%m/%d %H:%M:%S'
## Warning in strptime(x, f, tz = tz): unknown timezone '%Y/%m/%d %H:%M:%S'
## Warning in as.POSIXct.POSIXlt(as.POSIXlt(x, tz, ...), tz, ...): unknown
## timezone '%Y/%m/%d %H:%M:%S'
The data set is from a house in Europe, therefore time zone must be correctly set the data source tz.
## [1] "POSIXct" "POSIXt"
## 'data.frame': 1569894 obs. of 8 variables:
## $ DateTime : POSIXct, format: "2007-01-01 01:00:00" "2007-01-01 01:01:00" ...
## $ Global_active_power : num 2.58 2.55 2.55 2.55 2.55 ...
## $ Global_reactive_power: num 0.136 0.1 0.1 0.1 0.1 0.1 0.096 0 0 0 ...
## $ Global_intensity : num 10.6 10.4 10.4 10.4 10.4 10.4 10.4 10.2 10.2 10.2 ...
## $ Voltage : num 242 242 242 242 242 ...
## $ Sub_metering_1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Sub_metering_2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Sub_metering_3 : num 0 0 0 0 0 0 0 0 0 0 ...
For deep insight visualization and analysis, we subset the data into year, quaters, months, weeks, day, weekday, hourly, and minute. This should help provide granularity visualization.
4.1 Granularity - Subsetting and Meaningful Time Periods One of the goals of subsetting for visualizations is to adjust granularity to maximize the information to be gained. Granularity describes the frequency of observations within a time series data set. From the data description we know that the observations were taken once per minute over the period of almost 4 years. That’s over 2 million observations from the raw data prior to initial data munging. The new data set ‘new_DF’ with over 1.5 million observations needs to be subset into meaningful time periods for better visualizations and insight analysis.
4.1.1 Week Visualization and Analysis - Second Week of 2008
Plot Week Visualization - Second Week of 2008
4.1.2 Day Visualization and Analysis - Day 9 of January 2008
First, looking at a single day visualization on all three Sub-meters using scatter lines. We could see energy usage peaks at different times of the day in Sub-meter 3 area. The kitchen area only has high energy consumption between 5 PM and 6 PM. As expected, kitchen aapliances like cook-burner, oven, and microwave will be used for cooking when occupants are home.4.1.3 Minute Visualization and Analysis - January 9th (Granularity Adjusted)
Sub-meter 3 With the granularity adjusted to 10 minutes frequency, we get a much clearer picture of the power consumption on January 9th, 2008. 1. The plot shows higher energy usage in Sub-meter 3 between 6 AM and 2 PM. Another peak between 8 PM and 11 PM. 2. Considering the seasonality of the period, this peaks might represent the water heater usage because the home will not need AC usage in winter period. 3. The 6:30 to 8:30 AM double peaks may be due to water heater used for bathing. 4. It is also possible that the homeowner uses more hot water for every running water activity in the house due to the cold weather. 6. There high energy usage at night is also attributed to water heater.
Sub-meter 2 7. There are little energy usage about every 2 to 2 1/2 hours throughout the day in the laundry area. There may be an energy regulator device or energy saving appliances in the laundry area.
Sub-meter 1 The homeowner uses more energy for kitchen appliances between 5:40 AM and 6:30 PM but once during the day. With the adjusted granularity to 10 minutes, it becomes clearer that this may be due to cooking in the kitchen area.
Comparison of Same Days in Two Different Years
Pattern: 1. Comparing January 9th of 2008 to similar date in 2009, there seems to be a pattern of high energy peak in Sub-meter 3 during same time periods of the day. Higher energy usage concentrate in the early morning, and about 6:30 PM. 2. Small energy is observed in laundry area every 2 to 2 1/2 hours. 3. The homeowner also seems to use higher energy in the kitchen area once a day but at different time periods.
5.0 CREATING VISUALIZATION OF A RANDOM SUMMER WEEK, DAY, MINUTES IN 2007 & 2009
5.1 Creating visualization with for a random Week(Insight of mid summer)
5.2 Creating visualization for a random Day.
Looking with the Granularity adjustment for clearer picture. Insight Summer 2009: a. Sub-meter 3 energy usage peaked often during day and night in summer as expected. Possible due to more AC usage. b. Kitchen appliances energy usage also increase to about 6 times during the day but not often as AC/water heater c. Laundry energy usage peaked once.
5.3 Comparing Insights Between Summer 2007 and 2009:
| Opportunities/Recommendation: |
| 1. Assuming local power provider offers lower off-peak rates in the evening 7 PM to 11 PM, this homeowner may be saving on electricity rate by shifting more energy usage particularly, laundry and water heater and AC to those off peak hours. |
| 2. Energy usage does seem to decrease during the day, winter lowest energy recorded 7 Pm and 11 PM on all sub-meters. |
| 3. Mid afternoon shows higher energy consumption on all sub-meters. Reducing energy usage during this period maybe another savings opportunity. |
| 4. Compare High Energy Consumption between Summer and Winter |
| Here, looking for insights information on energy consumption by day of the week during two critical periods of the year associated with high energy consumption. A valueble information may provide potential opportunities for homeowner’s behavior modification. |
7.1 With initial visualizations completed, it’s time to prepare the data for Time Series Analysis and store data frame(s) as time series with appropriate start, end, and frequency Data is subset and then create a Time Series object using the ts() function.
7.1.1 Sample Data: chosing Sub-meter 3 with a frequency of 52 weekly observations per year Frequency of 52, Start Date of Jan 2007
Produce time series plots on Sub-meter 3
Other Visualization Plots on Sub-meter 3
Considering the plots above “autoplot” with labels and color, we can see the same energy consumption similarity patterns observed by comparing years 2007 to 2009 both in the bar charts and scatter lines in previously analysis does indeed repeat over time for Sub-meter 3.
7.1.2 Sub-meter 1 Time Series Visualization With Frequency of 52 Same Time Period
From the plots above, we can see the same energy low energy consumption similar patterns observed when comparing years 2007 to 2009 both in the bar charts and scatter lines in previous analysis does indeed repeat here for Sub-meter 1.
7.1.3 Sub-meter 2 Time Series Visualization With Frequency of 52 Same Time Period
Same patterns is observed here in Sub-meter 2 from the graph above.
Focusing on submeter 3 which accounts for a majority of the total Sub-meter energy consumption. Using Linear Regression model lm() for prediction and forcasting. Let’s first fit a linear model to the weekly data previosuly created. Before using these models to forecast future energy usage, let look at some of the assumptions of a linear regression model to determine if using a linear model is appropraite for this subset time series data.
8.1 Using Linear Regression (lm) to model to predict the trend in time series data. Create three different time series linear models – for three different time periods using the tslm and forecast functions. We’ll also forecast the trends of each time series model created.
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.346 0.0194 6.53 1.06 0.395 53 -485. 1078. 1243.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
Calculate RMSE
## [1] 5.314972
a). The summary() output provides a quick assessment of the model. However, using glance() it provides easy to read tabular outputs. b). Interestingly, the R-squared: 0.392 is significantly very low, and the p-value: 0.1367 does not seem to signify that at least one of the predictors or predictor(s) jointly are statistically significant.With a very low R-squared from 1, we could mean that the linear regression line or model was not a good fit. c). RMSE 5.85 indicates the absolute fit of the linear regression model to the data, which is how close the observed data points are the predicted values. Obviously, the lower the RMSE, the better the model fits.
## # A tibble: 53 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 11.3 3.40 3.32 0.00126
## 2 trend -0.0289 0.0121 -2.38 0.0190
## 3 season2 -3.72 5.00 -0.745 0.458
## 4 season3 -3.69 5.00 -0.739 0.461
## 5 season4 -9.67 5.00 -1.93 0.0557
## 6 season5 -3.64 4.99 -0.728 0.468
## 7 season6 -9.61 4.99 -1.92 0.0571
## 8 season7 -9.58 4.99 -1.92 0.0578
## 9 season8 -9.55 4.99 -1.91 0.0585
## 10 season9 -9.52 4.99 -1.91 0.0593
## # ... with 43 more rows
The output from the tidy() function tabulates the coefficient estimates with the corresponding standard error and the p-value for the 52 weeks (season) period.
8.2.1 Forcasting For Sub-meter 3 With the above analysis supporting the legitimacy of our linear models, we can feel more confident using it to make predictions for a time period energy consumption on submeter 3 using the forecast() function
Change the confidence levels and plot only the forecast portion that is above zero.
Insight Analysis: 1. To make a forecast with Sub-meter 3 linear model, we pass the model, the number of time periods, and the confidence level for the prediction interval. 2. The forcast plot above shows a trend line plot of the predicted values with the 80 and 90% prediction intervals. Sub-meter 3 is predicted to continue with high energy peaks but lower energy consumption in the future. The trend shows skewed pattern from historical trends analysed. 3. Actual energy consumption that falls outside of a predicted interval could alert of a potential issue with an appliance. 4. This forcast shows that this area of the house will continue with high energy consumption if no measure is taken to address energy usage. 5. The dark grey areas are 80% prediction intervals and the light grey the 90% prediction interval. The dark blue line is the average of the forecasted points.
8.2.2 Additional Visualizations, Analysis and Forcasting For Sub-meters 1 and 2:
Sub-meter 1 with same frequency, time period and confidence levels used for Sub-meter 3
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.330 -0.00532 4.29 0.984 0.516 53 -419. 946. 1111.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
Calculate RMSE
## [1] 3.490013
Forcasting For Sub-meter 1
Fine Tuning
Insight Analysis: 1. The forcast plot above shows a trend line plot of the predicted values with the 80 and 90% prediction intervals. Sub-meter 1 is predicted to continue lowest energy consumption in the future among the three Sub-meters. The trend shows similar pattern from historical trends analysed. 3. Actual energy consumption that falls outside of a predicted interval could alert of a potential issue with an appliances in the area.
Sub-meter 2 with similar frequency, time period and confidence levels
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.331 -0.00289 6.20 0.991 0.504 53 -477. 1062. 1227.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
Calculate RMSE
## [1] 3.490013
Insight Analysis: 1. The forcast plot above shows a trend line plot of the predicted values with the 80 and 90% prediction intervals. Sub-meter 2 is predicted to continue low energy consumption in the future. The trend shows energy consumption will drop in the near future but increases in later future time. 3. This prediction could alert of future higher enrgy consumption back to historical levels is measures not taken to maintain low usage.
9.0 DECOMPOSITION VISUALIZATION AND ANALYSIS OF SEASONAL TIME SERIES
According to The Little Book of R: “A seasonal time series consists of a trend component, a seasonal component and an irregular component. Decomposing the time series means separating the time series into these three components: that is, estimating these three components.” When analysing the trend of a time series independently of the seasonal components, seasonal adjustment method is used to remove the seasonal component of a time series that exhibits a seasonal pattern.
In order to correctly estimate any trend and seasonal components that might be in the time series the decompose() function in the forecast package is used. This estimates the trend, seasonal, and irregular components of a time series.
When the decompose() function is used, R returns three different objects (Seasonal component, Trend component, Random component) that can be accessed from the command line after running decompose() on the time series.
9.1 Decomposition Visualization Sub-meter 3 9.1.1 Does Sub-meter 3 show a trend in power usage? 9.1.2 This information will be important to a homeowner trying to understand their power consumption. 9.1.3 Does Sub-meter 3 show seasonal effects on power usage towards end of every year. What may or may not cause this?
## Length Class Mode
## x 157 ts numeric
## seasonal 157 ts numeric
## trend 157 ts numeric
## random 157 ts numeric
## figure 52 -none- numeric
## type 1 -none- character
## Time Series:
## Start = c(2007, 1)
## End = c(2010, 1)
## Frequency = 52
## [1] 6.038153 -2.961847 -2.961847 -2.961847 -2.966654 -2.976270 -2.904154
## [8] -2.740693 -2.658962 -2.668577 -2.591654 -2.514731 -2.437808 -2.360885
## [15] -1.870500 -1.375308 -1.793577 -1.846462 -1.981077 7.100653 -1.822424
## [22] -1.327231 6.667961 -2.336847 -1.841654 -1.351270 -3.061206 -3.000308
## [29] 5.672769 11.759307 -2.654154 -2.567616 5.432384 6.432384 -2.567616
## [36] -3.067616 6.018923 -2.308000 6.278538 -2.721462 -2.221462 -2.226270
## [43] 6.764115 -2.745500 -2.750308 15.244884 -2.764731 6.225653 -2.865693
## [50] -2.952231 -2.952231 7.042961 6.038153 -2.961847 -2.961847 -2.961847
## [57] -2.966654 -2.976270 -2.904154 -2.740693 -2.658962 -2.668577 -2.591654
## [64] -2.514731 -2.437808 -2.360885 -1.870500 -1.375308 -1.793577 -1.846462
## [71] -1.981077 7.100653 -1.822424 -1.327231 6.667961 -2.336847 -1.841654
## [78] -1.351270 -3.061206 -3.000308 5.672769 11.759307 -2.654154 -2.567616
## [85] 5.432384 6.432384 -2.567616 -3.067616 6.018923 -2.308000 6.278538
## [92] -2.721462 -2.221462 -2.226270 6.764115 -2.745500 -2.750308 15.244884
## [99] -2.764731 6.225653 -2.865693 -2.952231 -2.952231 7.042961 6.038153
## [106] -2.961847 -2.961847 -2.961847 -2.966654 -2.976270 -2.904154 -2.740693
## [113] -2.658962 -2.668577 -2.591654 -2.514731 -2.437808 -2.360885 -1.870500
## [120] -1.375308 -1.793577 -1.846462 -1.981077 7.100653 -1.822424 -1.327231
## [127] 6.667961 -2.336847 -1.841654 -1.351270 -3.061206 -3.000308 5.672769
## [134] 11.759307 -2.654154 -2.567616 5.432384 6.432384 -2.567616 -3.067616
## [141] 6.018923 -2.308000 6.278538 -2.721462 -2.221462 -2.226270 6.764115
## [148] -2.745500 -2.750308 15.244884 -2.764731 6.225653 -2.865693 -2.952231
## [155] -2.952231 7.042961 6.038153
Decomposition Insights: The plot above shows the original time series (observed), the estimated trend component (trend), the estimated seasonal component (seasonal), and the estimated irregular component (random). 1. We can see that the estimated trend component shows a huge decrease from about 2.5 in mid 2008 to about 2.5 in around 3rd quarter of 2008, followed by a zig zag and slow increase from then on to about mid of 2009. 2. As seen in the graph above, there’s a clear trend in energy usage in Sub-meter 3. Power usage steadily decreases from its high in mid 2007 to the lowest consumption in mid of 2008. 3. This information will be important to a homeowner trying to understand their power consumption. 4. The estimated seasonal factors are given for the period 2007 to 2010, and are the same for each year. The largest seasonal factor is (about 15.59), and the lowest is (about -4.54), indicating that there seems to be a peak and a trough in power consumption during this period. 5. The drop in energy consumption may be caused change in the homeowner energy usage behavior or activities in Sub-meter 3 area.
Further Visualizations and analysis:
9.2 Sub-meter 1 Decomposed plot With Same Frequency and Time Period
## Length Class Mode
## x 157 ts numeric
## seasonal 157 ts numeric
## trend 157 ts numeric
## random 157 ts numeric
## figure 52 -none- numeric
## type 1 -none- character
9.3 Sub-meter 2 Decomposed Plot With Same Frequency and Time Period
## Length Class Mode
## x 157 ts numeric
## seasonal 157 ts numeric
## trend 157 ts numeric
## random 157 ts numeric
## figure 52 -none- numeric
## type 1 -none- character
This could be caused by any factor like removal of energy-saving appliance from the laundry area. Sub-meter 2 shows seasonal effects on power usage during middle 0f each year.
HOLT-WINTERS FORCASTING
HoltWinters() function from the stats package helps to make forecasts. We can fit a simple exponential smoothing predictive model using HoltWinters() in R.
Remove Seasonal Components To use HoltWinters() for forecasting, seasonal component that was identified via decomposition must first need to be removed by using Seasonal adjusting.
10.1 Seasonal Adjusting Sub-meter 3
To confirm removal of seasonal component, let’s try decompose again. Although, there is a seasonal line, however we verify removal of seasonlity by looking at the scale for the seasonal section. -1.0e-15 through 1.0e-12.5 indictate a decimal with very very small number. For all practical purposes the seasonality removal is confirmed.
HoltWinters Simple Exponential Smoothing After removal of the the seasonal component, we can now use HoltWinters Simple Exponential Smoothing function. In the plot above the exponentially smooth fitted line is plotted in red along with the original data points. To understand how does exponential smoothing help, consider the outliers. Consider the information removed when subsetted millions of data points to just 52 observations per year.
The plot above shows the original time series in black, and the forecasts as a red line. The time series of forecasts is much smoother than the time series of the original data here.
As a measure of the accuracy of the forecasts, we can calculate the sum of squared errors for the in-sample forecast errors, that is, the forecast errors for the time period covered by our original time series. The sum-of-squared-errors is stored in a named element of the list variable “tsSM3_HW070809”.
Fine tuning by changing the confidence levels and then plot only the forecasted area. Think of this just as when a weatherperson forecasts the weather: The preceding years, weeks and days are not usually included in the forcast.
The resulting plot above shows a very consistent forecast for sub-meter 3
Further HoltWinters Visualizations with Sub-meters 1 and 2
10.2 Sub-meter 1 forecast plot and a plot containing only the forecasted area. Same frequency and time period.
To confirm removal of seasonal component.
HoltWinters Simple Exponential Smoothing After removal of the the seasonal component
The resulting image shows a very consistent forecast for sub-meter 3
10.3 Sub-meter 2 forecast plot and a plot containing only the forecasted area. Your choice of frequency and time period.
HoltWinters Simple Exponential Smoothing After removal of the the seasonal component let’s use HoltWinters simple exponential smoothing function.
Lastly, let’s change the the confidence levels and then plot only the forecasted area.
The resulting image shows a very consistent forecast for sub-meter 3